We first read in two data sets called “income” and “life” representing income and life expectancy values throughout a multitude of years. “Income” has 193 observations with 220 total variables whilst “Life” has 187 observations and 220 total variables.
income <- read.csv("https://ecoleman451.github.io/ecoleman/w6/income_per_person.csv")
life <- read.csv("https://ecoleman451.github.io/ecoleman/w6/life_expectancy_years.csv")
Next, we reshape both data sets such that there are only three columns (Geo, Year, Income or Life Expectancy):
# Reshape data set such that there are only three columns (Geo, Year, & Income)
new_income <- pivot_longer(income, cols = -geo, names_to = "year", values_to = "income")
head(new_income)
# A tibble: 6 × 3
geo year income
<chr> <chr> <int>
1 Afghanistan X1800 603
2 Afghanistan X1801 603
3 Afghanistan X1802 603
4 Afghanistan X1803 603
5 Afghanistan X1804 603
6 Afghanistan X1805 603
new_life <- pivot_longer(life, cols = -geo, names_to = "year", values_to = "life.expectancy")
head(new_life)
# A tibble: 6 × 3
geo year life.expectancy
<chr> <chr> <dbl>
1 Afghanistan X1800 28.2
2 Afghanistan X1801 28.2
3 Afghanistan X1802 28.2
4 Afghanistan X1803 28.2
5 Afghanistan X1804 28.2
6 Afghanistan X1805 28.2
We then merge these two new sets into a data set called “LifeExpIncom” which now contains Geo, Year, Income, & Life Expectancy (40953 observations and 4 variables).
## Create new data set
LifeExpIncom <- merge(new_life, new_income, by = c("geo", "year"))
head(LifeExpIncom)
geo year life.expectancy income
1 Afghanistan X1800 28.2 603
2 Afghanistan X1801 28.2 603
3 Afghanistan X1802 28.2 603
4 Afghanistan X1803 28.2 603
5 Afghanistan X1804 28.2 603
6 Afghanistan X1805 28.2 603
We then read in two more sets called “country” (240 observations and 11 variables) and “pop” (195 observations and 220 variables) respectively representing country and population data. We reshape the data set “pop” so that it coincides with “LifeExpIncom” and “Country” which already have the variable Year transformed into one column.
## Read in More Data
country <- read.csv("https://ecoleman451.github.io/ecoleman/w6/countries_total.csv")
pop <- read.csv("https://ecoleman451.github.io/ecoleman/w6/population_total.csv")
new_pop <- pivot_longer(pop, cols = -geo, names_to = "year", values_to = "population")
head(new_pop)
# A tibble: 6 × 3
geo year population
<chr> <chr> <int>
1 Afghanistan X1800 3280000
2 Afghanistan X1801 3280000
3 Afghanistan X1802 3280000
4 Afghanistan X1803 3280000
5 Afghanistan X1804 3280000
6 Afghanistan X1805 3280000
## Merge LifeExpIncom with Country
merged <- merge(LifeExpIncom, country, by.x = "geo", by.y = "name", all.x = TRUE)
head(merged)
geo year life.expectancy income alpha.2 alpha.3 country.code
1 Afghanistan X1817 28.0 604 AF AFG 4
2 Afghanistan X1815 28.1 604 AF AFG 4
3 Afghanistan X1812 28.1 604 AF AFG 4
4 Afghanistan X1814 28.1 604 AF AFG 4
5 Afghanistan X1811 28.1 604 AF AFG 4
6 Afghanistan X1816 28.1 604 AF AFG 4
iso_3166.2 region sub.region intermediate.region region.code
1 ISO 3166-2:AF Asia Southern Asia 142
2 ISO 3166-2:AF Asia Southern Asia 142
3 ISO 3166-2:AF Asia Southern Asia 142
4 ISO 3166-2:AF Asia Southern Asia 142
5 ISO 3166-2:AF Asia Southern Asia 142
6 ISO 3166-2:AF Asia Southern Asia 142
sub.region.code intermediate.region.code
1 34 NA
2 34 NA
3 34 NA
4 34 NA
5 34 NA
6 34 NA
After doing this, we’re able to merge “LifeExpIncom” with “Country” and then this newly merged set with our recently transformed “pop” set, creating a set called “fin_data” (42705 observations and 15 variables).
## Merge Population with Merged Data
fin_data <- merge(new_pop, merged, by = c("geo", "year"), all.x = TRUE)
head(fin_data)
geo year population life.expectancy income alpha.2 alpha.3
1 Afghanistan X1800 3280000 28.2 603 AF AFG
2 Afghanistan X1801 3280000 28.2 603 AF AFG
3 Afghanistan X1802 3280000 28.2 603 AF AFG
4 Afghanistan X1803 3280000 28.2 603 AF AFG
5 Afghanistan X1804 3280000 28.2 603 AF AFG
6 Afghanistan X1805 3280000 28.2 603 AF AFG
country.code iso_3166.2 region sub.region intermediate.region
1 4 ISO 3166-2:AF Asia Southern Asia
2 4 ISO 3166-2:AF Asia Southern Asia
3 4 ISO 3166-2:AF Asia Southern Asia
4 4 ISO 3166-2:AF Asia Southern Asia
5 4 ISO 3166-2:AF Asia Southern Asia
6 4 ISO 3166-2:AF Asia Southern Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 142 34 NA
3 142 34 NA
4 142 34 NA
5 142 34 NA
6 142 34 NA
After this, all that is left is to subset the data so that we only focus on data from the year 2015. This gives us our “final_data” (195 observations and 15 variables) set. Now, let’s look at the overall summary statistics for the data set “fin_data” which contains not just data from 2015, but from all years from the data set.
## Get Data for Year 2015
final_data <- subset(fin_data, year =="X2015")
summary(fin_data)
geo year population life.expectancy
Length:42705 Length:42705 Min. :6.420e+02 Min. : 1.00
Class :character Class :character 1st Qu.:2.830e+05 1st Qu.:31.20
Mode :character Mode :character Median :1.710e+06 Median :35.50
Mean :1.298e+07 Mean :43.13
3rd Qu.:5.940e+06 3rd Qu.:56.00
Max. :1.420e+09 Max. :84.20
NA's :2268
income alpha.2 alpha.3 country.code
Min. : 247 Length:42705 Length:42705 Min. : 4.0
1st Qu.: 875 Class :character Class :character 1st Qu.:208.0
Median : 1440 Mode :character Mode :character Median :418.0
Mean : 4591 Mean :424.9
3rd Qu.: 3460 3rd Qu.:643.0
Max. :178000 Max. :894.0
NA's :1752 NA's :4599
iso_3166.2 region sub.region intermediate.region
Length:42705 Length:42705 Length:42705 Length:42705
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
region.code sub.region.code intermediate.region.code
Min. : 2.00 Min. : 15.0 Min. : 5.00
1st Qu.: 2.00 1st Qu.: 54.0 1st Qu.:11.00
Median : 19.00 Median :154.0 Median :14.00
Mean : 71.74 Mean :177.9 Mean :14.89
3rd Qu.:142.00 3rd Qu.:202.0 3rd Qu.:17.00
Max. :150.00 Max. :419.0 Max. :29.00
NA's :4599 NA's :4599 NA's :26061
scatter_plot <- plot_ly(
data = final_data,
x = ~income,
y = ~life.expectancy,
size = ~population,
color = ~geo,
text = ~paste("Country: ", geo, "<br>Population: ", population),
type = "scatter",
mode = "markers",
marker = list(
opacity = 0.6, ## Transparency level
sizemode = "diameter", ## Set the size mode to diameter
sizeref = 0.1, ## Adjust the size reference for better visibility
line = list(
color = "black", ## Boundary color for points
width = 1 ## Boundary width
)
)
)
layout <- list(
title = "Association Between Life Expectancy and Income (Year 2015)",
xaxis = list(title = "Income"),
yaxis = list(title = "Life Expectancy"),
showlegend = FALSE ## Hide legend for individual countries
)
## Combine the plot and layout
scatter_plot <- layout(scatter_plot, layout)
## Display the interactive scatter plot
scatter_plot
The above plot shows the relationship between income, life expectancy, and population size across different countries in the year 2015. Each point is a country and the size of the points correlate with the population size of that specific country. The countries are each color coded as well.
The x-axis looks at the income levels for each country. Countries that have higher incomes will be skewed to the right. The y-axis looks at life expectancy. Countries with higher life expectancy will be skewed higher on the y axis. From looking at the plot, we can see that there are some countries that primarily take over the scatter plot as opposed to others depending on population size and income. We can look at whether countries that have higher incomes generally have longer life expectancies or examine the population sizes to see if they correlate with higher or lower income levels.